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1. INTRODUCTION 

Data mining is a way for extracting or mining knowledge from large amounts of data [{J]-[4]. In 
developing data mining application, the amount data taken from various repositories such as databases, data 
warehouse, and World Wide Web (WWW). is typically huge to be either stored or processed. Long time may 
be required for analyzing complex data and mining on huge amounts of data. Therefore, it makes such analysis 
sometimes impractical or infeasible. Data reduction techniques are traditionally applied to find a reduced 
representation of the dataset, which is much smaller in size ensuring the close integrity of the original data. To 
what follows, mining on the reduced dataset should be more efficient producing the same or almost the same 
analytical results. The common strategies for data reduction include data cube aggregation, attribute subset 
selection, dimensionality reduction (DR) and numerosity reduction [I]. 

Recently, the dataset size in terms of number of records and attributes is exploring very rapidly, which 
prompts the development of a number of big-data platforms, parallel data analytics algorithms and the usage of 
data DR procedures efficiently. In order to handle the real-world data effectively, the respective dimensionality 
needs to be reduced in an effective (more economic) amount. DR is the study of methods of transformations 
for reducing the number of dimensions describing the object of high-dimensional data into a meaningful rep- 
resentation of reduced dimensionality. Theoretically, the reduced representation of dataset should have such a 
dimensionality that corresponds to the intrinsic dimensionality of the dataset. The intrinsic dimensionality of 
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dataset means the minimum number of arguments needed to account for the observed properties of the data. The 
general objectives of DR are to remove irrelevant and redundant data for reducing the manipulation cost and 
avoiding data over-fitting, and increasing the quality of data for efficient data-intensive processing tasks, such 
as pattern recognition, data mining, visualization, database navigation, and compression of high-dimensional 
data. As such, DR offers an effective solution to the diverse problem of “curse of dimensionality” and fixes 
other undesired properties of high-dimensional spaces [5]. 

Mathematically, the DR techniques can be defined as to convert a given dataset represented in an x D 
matrix X consisting of n data vectors 7;,2 = 1,2,..., with dimensionality D into another dataset Y that has 
an intrinsic dimensionality d, where d < D, and often d << D. The intrinsic dimensionality of data signifies 
that the points in dataset X are belonging to or near a manifold with dimensionality d that is implanted in the D 
dimensional space. In another words, the DR methods encode the given dataset X having dimensionality D into 
anew dataset Y with dimensionality d retaining the geometry of the data as much as possible. In general, neither 
the intrinsic dimensionality d of the dataset X nor the geometry of the data manifold is completely known. 
Therefore, DR of a dataset is an ill-posed problem that can only be solved by assuming certain properties of 
the data such as its intrinsic dimensionality [5]. 

There are some DR techniques for the purpose of taking a smaller image and compression and there 
are some other DR techniques for machine learning purpose (e.g., for better data analysis, classification, statis- 
tics, and visualization) [6]. In machine learning, dimension reduction is usually concerning with the feature 
vectors. In this case, DR techniques can be divided into two categories: feature extraction and feature selection 
methods. Feature extraction can further be divided into linear and non-linear methods. The main goal of some 
methods is to preserve fidelity with respect to the original data using a certain metric such as mean squared 
error, and the goal of some other methods is to improve the performance of a typical task, such as classifica- 
tion, prediction, and visualization [7]. Linear feature extraction methods include principal component analysis 
(PCA), factor analysis, independent component analysis (ICA), and linear discriminant analysis (LDA). Non- 
linear feature extraction methods include the front-ranked techniques such as multidimensional scaling (MDS), 
Isomap, maximum variance unfolding, kernel PCA etc [5]. Feature selection is divided into feature ranking and 
feature subset selection. Feature ranking commonly uses two scoring function, such as Euclidean distance and 
correlation and information gain ratio. On the other hand, the feature subset selection methods are divided into 
filter method, wrapper method and embedded method. The filter methods do not use any learning algorithm 
[B]. 

In this paper, after conducting a comprehensive study on the DR techniques, we present a face recog- 
nition approach using PCA transformation. We perform experiment using Olivetti Research Laboratory (ORL) 
and Yale face databases. The experimental results manifest the superiority of the proposed method. The main 
contribution of this paper is listed: i) comprehensive study on the DR techniques; ii) technical and mathematical 
intuitions behind the PCA approach; iii) two face recognition proposals using PCA data; and iv) performance 
evaluation on ORL and Yale face databases. 

The remainder of this paper is organized as follows. We provide the technical detail of the PCA 
method in section Then, we discuss the related works to ours in section After that, we explain the 
proposed face recognition approach in section|4,] The experiments and results are provided in section] At last, 
we summarize and conclude the findings and observations in section[6.] 


2. PRINCIPAL COMPONENT ANALYSIS 

The constituent attributes of real-world dataset reveal relationships among them. The relationships 
are often linear or approximately linear. This makes the attributes amenable to common analysis techniques. 
One of such techniques is PCA, which rotates the original data to new coordinates with a view to making the 
data as flat as possible. PCA is a statistical transformation that identifies patterns in data through detecting the 
correlation between attributes [9]. If there exists a strong correlation between attributes, the attempt to reduce 
the dimensionality only makes sense. PCA finds the directions of maximum variance in high-dimensional 
data and then projects it onto a reduced dimensional subspace while retaining most of the information of the 
original dataset [10]. Mathematically, given a matrix of two or more attributes, PCA produces a new matrix 
with the same number of attributes, called the principal components. Each generated principal component 
is a linear transformation of the entire original dataset. The measurements of the principal components are 
calculated in such a way that the first principal component holds the maximum variance, which can tentatively 
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be thought as the maximum information. The second principal component is calculated to have the second 
most variance, and, significantly, in a linear sense is uncorrelated with the first principal component. The 
further principal components, if there are any, exhibit decreasing variance and are uncorrelated with all other 
principal components. The steps for the implementation of PCA are illustrated |11]: 


— Step 1: Take the whole dataset consisting of d-dimensional samples ignoring the class labels. 


— Step 2: Compute the d-dimensional mean vector. The mean vector consists of the means of each variable. 

The mean is the sum of the data points divided by the number of data points. That is, yu = A = De Ai 
The mean is that value that is most commonly referred to as the average. The mean vector is often referred 
to as the centroid. The variance is roughly the arithmetic average of the squared distance from the mean. 


The variance is defined as o? = s? = var(A) = oy, os. where A is the mean of the data. Note 


that the standard deviation (ø) is the square root of the variance. 
— Step 3: Compute the covariance matrix, alternatively, the scatter matrix of the whole dataset. 


a. Covariance matrix: The variance-covariance matrix consists of the variances of the variables along 
the main diagonal and the covariances between each pair of variables in the other matrix posi- 
tions. The formula for computing the covariance of the variables S and T is covar(S,T) = 

n i—S)(Ti-T ral 7 é 7 
2a ) where S and T denote the means of S and T, respectively. The covariance 
matrix is defined as 
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Here, a is the variance of each variable A; in A, Ty is the covariance between A; and A, in A. 
b. Scatter matrix: The scatter matrix is computed as )>""_, (A; — m)(A; — m)”, where m is the mean 


n 


vector and it is defined as m = a Ai 


— Step 4: Perform eigendecomposition i.e., compute eigenvectors (e1, €2, ..., €q) and corresponding eigen- 
values (Aj, A2,..-, Aa). The eigenvectors or principal components determine the directions of the new 
feature space, and the eigenvalues determine their magnitude. 


— Step 5: Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest 
eigenvalues to form a d x k dimensional matrix W, where every column represents an eigenvector and 
k is the number of dimensions of the new feature subspace with k < d. 


— Step 6: Use the d x k eigenvector projection matrix, W to transform the original samples onto the new 
subspace. This can be summarized by the mathematical equation: y7 = W7 x x7, where x isa 1 x d- 
dimensional vector representing one sample, and y is the transformed 1 x k-dimensional sample in the 
new subspace. Alternatively, this can be performed as Y = A x W (or Y7 = W7 A7), where Y is the 
transformed n x k-dimensional samples in the new subspace. 


3. RELATED WORK 

Dash et al. presented a PCA based entropy measure for ranking features and compares with a 
similar feature ranking method (Relief) in [12]. Maaten, Postma, and Herik have investigated the performances 
of the nonlinear techniques on artificial and natural tasks, also conduct review and systematic comparison of DR 
techniques [5]. Spectral DR methods have explained with a short tutorial in the following paper [13]. In review 
work [14], the authors categorized the plethora of available DR methods and illustrated the mathematical insight 
behind them. Looga, Ginneken, and Duin have proposed a DR technique for image features using the canonical 
contextual correlation projection in [15]. In article, the authors provide a comprehensive review and 
comparison of the performance of the principal methods of dimension reduction proposed in the approximate 
Bayesian computation literature. Silipo, Adae, and Berthold have discussed seven techniques for DR which are 
missing values, low variance filter, high correlation filter, PCA, random forests, backward feature elimination, 
and forward feature construction in [17]. Joshi and Machchhar conduct a comprehensive survey on DR 
methods and proposed a DR method that depends upon the given set of parameters and varying conditions [18]. 
The authors investigate that recursive feature elimination, and genetic and evolutionary feature weighting and 
selection give better classification result than PCA [19]. 
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Several works have also been conducted on recognition problem based on PCA in various ways. 
Huang and Yin compare and investigate linear PCA and various nonlinear techniques for face recognition. 
Alkandari and Aljaber have presented the importance of PCA to identify the facial image without human 
intervention [2I]. Dandpat and Meher proposed a face recognition for improving performance using PCA 
and two-dimensional PCA in [22]. PCA in linear discriminant analysis space for face recognition has been 
proposed by Su and Wang [23]. The following paper investigates the performance when two DR methods such 
as self-organizing map (SOM) and PCA have been combined [24]. 


4. PROPOSED APPROACH TO FACE RECOGNITION 

In this paper, after discussing the working principle of PCA in detail, we propose a solution for face 
recognition problem based the principal components of the training grayscale face image matrices. The pro- 
posal is a customization of various principal components-based existing classifiers. The main customization is 
made in case of deriving the training and test sets, where the images are placed as matrices rather than as vectors 
of the traditional approaches and introducing the transpose of the main sets as discussed later. To implement 
the proposal, the face recognition problem is divided into two categories. 


4.1. Problem statement-1: Recognition of a typical face 

Given a new image, classify it to “face” or “non-face” from a set of N original peoples’ face images, 
each image is R pixels high by C pixels wide i.e., the pixel resolution is R x C. To solve this, we merge N 
training image matrices into a single big matrix by placing one after another. Then, we also place the input 
image matrix N times one after another to form another big matrix. After that, we take the transpose of both 
big matrices. Subsequently, we apply PCA on the four big matrices and select k eigenvectors for each. We 
then determine the similarity of the normal input big matrix with the normal training big matrix, and transposed 
input big matrix with the transposed training big matrix using selected k features (eigenvectors). Finally, the 
decision is taken based the similarity result. The solution is illustrated with the following steps: 


— Step 1: Input the N original images of size R x C. 
— Step 2: For each of the N images, convert the image to a matrix of length (dimension) R x C 


a. Step 2.1: Put all the matrices together in one big image-matrix, Train1 like this: 


ImageMatrix1 
ImageMatrix2 
Traini = 
ImageMatrixN 
b. Step 2.2: Take the transpose of Train1 and assign it to another matrix, Train2. 


Train2 = Transpose(Train1) 


— Step 3: For the new image to be classified, 


a. Step 3.1: Convert the image to a matrix of length R x C and put it N times together in another big 
image-matrix, Test1 like this: 


NewImageMatrix 
NewImageMatrix 
Test1 = 
NewImageMatrix 
b. Step 3.2: Take the transpose of 'Test1 and assign it to another matrix, Test2. 


Test2 = Transpose(Test1) 


[label=.] 
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— Step 4: For both big image matrices, 
a. Step 4.1: Apply PCA 
b. Step 4.2: Select k eigenvectors with the highest eigenvalues 
— Step 5: Determine the similarity of the new image with the existing images using the k extracted features 


i.e., determine the similarity of Test1 with Train1 and Test2 with Train2 using the k extracted 
features. 


— Step 6: Classify the new input image either to “face” if the similarity is highest, or to “non-face’’, other- 
wise. 


4.2. Problem statement-2: Recognition of individual face 
Given a new image, classify it to most similar image(s) from a set of N original face images for each 

of the m peoples, each image is R pixels high by C pixels wide i.e., the size is R x C. To solve this, we 
merge N training image matrices for each of the m people into a separate single big matrix by placing one 
after another. Then, we also place the input image matrix N times one after another to form another big matrix. 
After that, we take the transpose of all big matrices. Subsequently, we apply PCA on all big matrices and 
select k eigenvectors for each. We then determine the similarity of the normal input big matrix with all normal 
training big matrices, and transposed input big matrix with all transposed training big matrices using selected k 
features. Finally, the decision is taken based the similarity result. The solution is illustrated with the following 
steps: 

e Step 1: Input the N original images of size R x C for each of the m peoples. 

e Step 2: For each of the N images of each of the m peoples, convert it into a matrix of length R x C 


— Step 2.1: Put the matrices together in a separate big image-matrix, Train3 like this: 


ImageMatrix1 
ImageMatrix2 
Train3 = 


ImageMatrixN 


— Step 2.2: Take the transpose of Train3 and assign it to another matrix, Train4. 
Train4 = Transpose(Train3) 


e Step 3: For the new image to be classified, 


— Step 3.1: Convert the image to a matrix of length R x C and put it N times together in another big 
image-matrix, Test3 like this: 


NewImageMatrix 
NewImageMatrix 
Test3 = 


NewImageMatrix 
— Step 3.2: Take the transpose of Test4 and assign it to another matrix, Test3. 


Test4 = Transpose(Test3) 


e Step 4: For both big image matrices, 
— Step 4.1: Apply PCA 
— Step 4.2: Select k eigenvectors with the highest eigenvalues 


e Step 5: Determine the similarity of the new image with all the existing images of m peoples using the 
k extracted features i.e., determine the similarity of Test3 with Train3 and Test4 with Train4 using 
the k extracted features. 
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e Step 6: Classify the new input image to the most probable image(s) with the highest similarity. 


To determine the similarity for both problem statements, first, each eigenvector in a training set is 
subtracted with its corresponding eigenvector in the testing set. Then the result of each eigenvector is averaged. 
Now, the new instance is classified as “yes”, if the average values are near to a threshold value, say a, that 
would be ideally around zero (0). 


5. RESULTS 

The proposed method for face recognition based on principal components has been implemented in 
MATLAB simulation platform. The implemented code has been tested for some common face images captured 
manually. In addition, it has been tested for the two popular face image databases: ORL and Yale. In ORL 
database, there are 10 different grayscale images of each of 40 distinct subjects. For some of the subjects, 
the images were taken at different times, and with the variation of lighting and facial expressions. All images 
were captured against a dark homogeneous background with the subjects in an upright, frontal position. In 
Yale database, there are 11 different grayscale images of each of 15 distinct subjects/individuals, one per 
different facial expression or configuration. Yale has extensions called Extended Yale Face Database A and B. 
Extended Yale Face Database B has 38 subjects/individuals and around 64 near frontal images under different 
illuminations per subject. For both databases, there are two types of pixel resolution for the images available: 
32 x 32 and 64 x 64. Some images from ORL and Yale and extended Yale face database B are shown in Figures 
1 (a-c) respectively while Table[I]shows the results on different data distributions. 





Figure 1. Face databases: (a) sample images from the ORL database, (b) sample images from the Yale 
database, and (c) sample images from the extended Yale face database B 


For the database, the training and testing sets are created in the same manner mentioned above. For 
the first problem statement, a random subset of images from every subject was taken to form the training set, 
Train1 and thus Train2. The other images were considered to be the testing set, Test1 and thus Test2. 
For the second problem statement, a random subset of images per every subject was taken to form the training 
set, Train3 and thus Train4. Any of the rest image(s) of the respective subject, upon which the training 
sets are formed, was considered to be the testing set, Test3 and thus Test4. The recognition result of the 
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Table 1. Databases and results 








Task Database Total number : uae or a 
of samples individual subject 
Theoretically: 0; 
Recognition of a typical face ORL 409 40 Practically: around 0 
Theoretically: 0; 
Yale Ae 38 Practically: around 0 
Theoretically: 0; 
Recognition of Individual Face OBL a 9 Practically: around 0 
Yale 2432 38 Theoretically: 0; 


Practically: around 0 





proposed method was quite acceptable because of, especially, the training sets, Train2 and Train4, which 
are the transpose of the original training sets, Train1 and Train3 respectively. The recognition accuracy can 
significantly be decreased with the inconsistent images in the training sets. 


6. CONCLUSION AND FUTURE WORK 

The discussed comprehensive overview of DR techniques and the working principle of PCA can be the 
ingredients for developing a typical image-data mining application. The proposed method for face recognition 
based on principal components can, mostly, be used in those applications where a few images are enough to 
train. The proposed approach can be used for not only face recognition but also for other kind of objects 
recognition in the same manner. In future, the proposed technique will be applied on ORL and Yale databases 
completely along with other face databases and its performance will be compared with the existing classifiers 
based on either machine learning algorithms or other statistical approaches. In addition, an adaptive range of 
the threshold, a to recognize an instance will be determined. 
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